-
Notifications
You must be signed in to change notification settings - Fork 452
OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5305
Conversation
Hi @chizhang21. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @yuqi-zhang @pablintino , could you please take a look at this PR? I’m available for any follow-ups. Thanks! |
} | ||
|
||
// make sure local node's Unschedulable state is updated | ||
node.Spec.Unschedulable = updatedNode.Spec.Unschedulable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think what you propose does make sense from a functional point of view, I have a few comments:
- The node variable is not local, we are modifiying a reference thus changing it for the caller too, and I wouldn't be expecting this function to do write to the reference I pass. The underlaying cordon/uncordon logic only makes reads, and that's what I'd expect from this function too.
- From the previous point. Why aren't us consuming inside the ExponentialBackoff the updated reference always? In the first loop it can be initialized with the node reference to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I’ll update it a bit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback—updates pushed. Could you take another look?
return false, nil | ||
} | ||
|
||
// make sure local node's Unschedulable state is updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an OCPBUGS ticket to track this? To which versions does this apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related bug may not have been created yet. And we encountered this issue on version 4.19, so I expect it to be fixed in 4.19 and later versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to create a OCPBUGS ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, cause our QE would need to test it + our automation requires it to include it in release branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Pablo, ticket OCPBUGS-62341 has been created: https://issues.redhat.com/browse/OCPBUGS-62341
Please review when convenient. Thank you!
f049755
to
75aef33
Compare
Signed-off-by: cz21 <[email protected]>
75aef33
to
0c647cc
Compare
@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/ok-to-test |
/retest |
1 similar comment
/retest |
Verified using IPI on AWS
This is an example of a webhook failing all cordon/uncordon operations: https://github.com/sergiordlr/temp-testfiles/tree/master/webhook_example
If we repeat the steps in a cluster without the fix, the controller is not able to cordon the node after removing the webhook. /label qe-approved |
/jira refresh |
@pablintino: This pull request references Jira Issue OCPBUGS-62341, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chizhang21, pablintino The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/verified by @sergiordlr as per #5305 (comment) |
@pablintino: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@chizhang21: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
1c25a90
into
openshift:main
@chizhang21: Jira Issue Verification Checks: Jira Issue OCPBUGS-62341 Jira Issue OCPBUGS-62341 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix included in accepted release 4.21.0-0.nightly-2025-10-02-002727 |
Hi Pablo, could we backport this fix to 4.19.z and 4.20.z? |
/cherry-pick release-4.19 |
/cherry-pick release-4.20 |
@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Sorry @chizhang21, I missed your comment, I'll do the cherry-pick |
/cherry-pick release-4.19 |
@pablintino: new pull request created: #5348 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pablintino: new pull request created: #5349 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Thank you! |
- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state
- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.
- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed